How to represent a document?
What are vector space and bag-of-words models?
Features in text? And how to do text feature selection?
How to classify text data?
How to evaluate a classifier?
How to represent a document?
What are vector space and bag-of-words models?
Features in text? And how to do text feature selection?
How to classify text data?
How to evaluate a classifier?
Represent by a string?
Represent by a list of sentences?
Represent by a vector?
A vector space is a collection of vectors
Represent documents by concept vectors
Each concept defines one dimension
k concepts define a high-dimensional space
Element of vector corresponds to concept weight
Distance between the vectors in this concept space
The process of converting text into numbers is called Vectorization
Terms are generic features that can be extracted from text
Typically, terms are single words, keywords, n-grams, or phrases
Documents are represented as vectors of terms
Each dimension (concept) corresponds to a separate term
\[d = (w_1, ..., w_n)\]
The Corpus represents a collection of documents (the dataset)
The Vocabulary is the set of all unique terms in the corpus
How to define/select the “basic concept”
How to assign weights
Bag of Words
Topics
Word Embeddings
With Bag of Words (BOW), we refer to a Vector Space Model where:
Terms: words (more generally we may use n-grams, etc.)
Weights: number of occurrences of the terms in the document
Term as the basis for vector space
Doc1: Text mining is to identify useful information.
Doc2: Useful information is mined from text.
Doc3: Apple is delicious.
Zipf’s law
The most frequent word occurs 90,000 times, the second one 45,000 times, the third 30,000, and so on.
In a clinical Dutch text dataset, if we know the most popular word’s frequency is 69,548, what is your best estimate of its second and third most popular word’s frequencies respectively?
Binary
TF
TF-IDF
Idea: a term is more important if it occurs more frequently in a document
TF Formulas
Let \(t(c,d)\) be the frequency count of term \(t\) in doc \(d\)
Raw TF: \(tf(t,d) = c(t,d)\)
Solution
Assign higher weights to rare terms
Formula
A corpus-specific property
Let \(n_{d,t}\) denote the number of times the \(t\)-th term appears in the \(d\)-th document.
\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing the \(t\)-th term.
\[IDF_t = log(\frac{N}{N_t})\] TF-IDF weight:
\[w_{d,t} = TF_{d,t} \cdot IDF_t\]
If we remove one document from the corpus, how would it affect the IDF of words in the vocabulary?
If we add one document from the corpus, how would it affect the IDF of words in the vocabulary?
Euclidean distance
\(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)
Longer documents will be penalized by the extra words
We care more about how these two vectors are overlapped
Supervised learning: Learning a function that maps an input to an output based on example input-output pairs.
infer a function from labeled training data
use the inferred function to label new instances
Human experts annotate a set of text data
Automatically classify political news from sports news
Recognizing spam emails
MEDLINE
Antogonists and Inhibitors
Blood Supply
Chemistry
Drug Therapy
Embryology
Epidemiology
…
Assigning subject categories, topics, or genres
Spam detection
Authorship identification
Age/gender identification
Language Identification
Sentiment analysis
…
Which problem is not a text classification task? (less likely to be)
Author’s gender detection from text
Finding about the smoking conditions of patients from clinical letters
Grouping news articles into political vs non-political news
Classifying reviews into positive and negative sentiment
Go to www.menti.com and use the code 86 08 86 5
Training set
Test set
Validation set (dev set)
https://scikit-learn.org/stable/modules/cross_validation.html
Feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm.
high dimensionality of text features
Select the most informative features for model training
Reduce noise in feature representation
Improve final classification performance
Improve training/testing efficiency
Less time complexity
Fewer training data
Wrapper method
Wrapper method
Search in the whole space of feature groups
Wrapper method
Consider all possible dependencies among the features
Impractical for text categorization
Cannot deal with large feature set
A NP-complete problem
Filter method
Evaluate the features independently from the classifier and other features
No indication of a classifier’s performance on the selected features
No dependency among the features
Feasible for very large feature set
Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:
\[\sum^k_{c=1}{p(c | t)=1}\]
Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:
\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]
The value of the gini-index lies in the range \((1/k, 1)\).
Higher values of the gini-index indicate a greater discriminative power of the term t.
If the global class distribution is skewed, the gini-index may not accurately reflect the discriminative power of the underlying attributes.
➔ normalized gini-index
\[p'(c|t) \equiv \frac{p(c|t)/p(c)}{\sum_{i=1}^k{p(i|t)/p(i)}} \]
\[G(t) \equiv \sum^k_{c=1}{p'(c|t)^2}\]
The pointwise mutual information \(M_c(t)\) between the term \(t\) and the class \(c\) is defined on the basis of the level of co-occurrence between the class \(c\) and term \(t\). Let \(p(c)\) be the unconditional probability of class \(c\), and \(p(c | t)\) be the probability of class \(c\), given that the document contains the term \(t\).
Let \(p(t)\) be the fraction of the documents containing the term \(t\), i.e. the unconditional probability of term \(t\).
The expected co-occurrence of class \(c\) and term \(t\) on the basis of mutual independence is given by \(p(c) \cdot p(t)\). The true co-occurrence is of course given by \(p(c | t) \cdot p(t)\).
In practice, the value of \(p(c | t) \cdot p(t)\) may be much larger or smaller than \(p(c) \cdot p(t)\), depending upon the level of correlation between the class \(c\) and term \(t\). The mutual information is defined in terms of the ratio between these two values.
\[M_c(t) = log(\frac{p(c|t) \cdot p(t)}{p(c) \cdot p(t)}) = log(\frac{p(c|t)}{p(c)})\] Clearly, the term \(t\) is positively correlated to the class \(c\), when \(M_c (t) > 0\), and the term \(t\) is negatively correlated to class \(c\), when \(M_c (t) < 0\).
Note that \(M_c(t)\) is specific to a particular class \(c\). We need to compute the overall mutual information as a function of the mutual information of the term \(t\) with the different classes.
\[M_{avg}(t) = \sum^k_{c=1}{p(c) \cdot M_c(t)}\] \[M_{max}(t) = \max_c{\{M_c(t)\}}\] The second measure is particularly useful, when it is more important to determine high levels of positive correlation of the term \(t\) with any of the classes.
The \({\chi^2}\)-statistic is a different way to compute the lack of independence between the term \(t\) and a particular class \(c\). Let \(n\) be the total number of documents, then:
\[{\chi}_c^2(t) = \frac{n \cdot p(t)^2 \cdot (p(c|t) - p(c))^2}{p(t) \cdot (1- p(t)) \cdot p(c) \cdot (1 - p(c))}\]
As in the case of the mutual information, we can compute a global \({\chi^2}\) statistic from the class-specific values. One major advantage of the \({\chi^2}\)-statistic is that it is a normalized value and we can test statistical significance using the \({\chi^2}\) distribution with one degree of freedom.
Test whether distributions of two categorical variables are independent of one another
Degree of freedom = \((\#col-1) \times (\#row-1)\)
Significance level: \(\alpha\), i.e., \(p\mbox{-}value<\alpha\)
↳ Look into \({\chi}^2\) distribution table to find the threshold
\({\chi}^2\) statistics
Test whether distributions of two categorical variables are independent of one another
Degree of freedom = \((\#col-1) \times (\#row-1)\)
Significance level: \(\alpha\), i.e., \(p\mbox{-}value<\alpha\)
For the features passing the threshold, rank them by descending order of \({\chi}^2\) values and choose the top \(k\) features
\({\chi}^2\) statistics with multiple categories
\({\chi}^2=\sum_c{p(c) {\chi}^2(c,t)}\)
\({\chi}^2(t) = \underset{c}{max}\ {\chi}^2(c,t)\)
Many other metrics (Same trick as in \(\chi^2\) statistics for multi-class cases)
Mutual information
\[PMI(t;c) = p(t,c)log(\frac{p(t,c)}{p(t)p(c)})\]
Odds ratio
\[Odds(t;c) = \frac{p(t,c)}{1 - p(t,c)} \times \frac{1 - p(t,\bar{c})}{p(t,\bar{c})}\]
Input:
a document \(d\)
a fixed set of classes \(C = \{c_1, c_2,…, c_J\}\)
Output: a predicted class \(c \in C\)
Rules based on combinations of words or other features
Accuracy can be high: If rules carefully refined by expert
But building and maintaining these rules is expensive
Data/Domain specifics
Input:
a document \(d\)
a fixed set of classes \(C = \{c_1, c_2,…, c_J\}\)
A training set of \(m\) hand-labeled documents \((d_1,c_1),\cdots,(d_m,c_m)\)
Output:
Any kind of classifier
Naïve Bayes
Logistic regression
Support-vector machines
k-Nearest Neighbors
Neural Networks
Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. Using a training set of documents, the Rocchio algorithm builds a prototype vector, centroid, for each class. This prototype is an average vector over the training documents’ vectors that belong to a certain class.
\[\boldsymbol{\mu_c} = \frac{1}{|D_c|}\sum_{\mathbf{d} \in D_c}{\mathbf{d}}\]
Where \(D_c\) is the set of documents in the corpus that belongs to class \(c\) and \(d\) is the vector representation of document \(d\).
The predicted label of document d is the one with the smallest (Euclidean) distance between the document and the centroid.
\[\hat{c} = \arg \min_c ||\boldsymbol{\mu_c} - \mathbf{d}||\]
Given a test document d,. the KNN algorithm finds the k nearest neighbors of d among all the documents in the training set, and scores the category candidates based on the class of the k neighbors. After sorting the score values, the algorithm assigns the candidate to the class with the highest score.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit.
Simple (“naïve”) classification method based on Bayes rule
Relies on very simple representation of document
\[P(c|d) = \frac{P(d|c)P(c)}{P(d)}\]
\[P(x_1, x_2, \ldots, x_n|c)\]
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature probabilities \(P(x_i|c_j)\) are independent given the class \(c\).
\[P(x_1, \ldots, x_n|c) = P(x_1 | c) \cdot P(x_2|c) \cdot P(x_3|c) \cdot \ldots \cdot P(x_n|c)\]
\[C_{MAP} = \underset{c \in C}{\operatorname{argmax}}P(x_1, x_2, \ldots,x_n|c)P(c)\]
\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c_j)\prod_{x \in X}{P(x|c)}\] \[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c_j)\prod_{i \in positions}{P(x_i|c_i)}\]
First attempt: maximum likelihood estimates
\[\hat{P}(c_j) = \frac{doccount(C = c_j)}{N_{doc}}\] \[\hat{P}(w_i|c_j) = \frac{count(w_i, c_j)}{\sum_{w \in V}count(w, c_j)}\]
fraction of times word \(w_i\) appears among all words in documents of topic \(c_j\)
Create mega-document for topic \(j\) by concatenating all docs in this topic
What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?
\[\hat{P}(\mathrm{"fantastic"|positive}) = \frac{count(\mathrm{"fantastic", positive})}{\sum_{w \in V}{count(\mathrm{w,positive})}}\] Zero probabilities cannot be conditioned away, no matter the other evidence!
\[C_{MAP} = \underset{c}{\operatorname{argmax}}\hat{P}(c)\prod_i{\hat{P}(x_i|c)}\]
\[ \begin{align} \hat{P}(w_i|c) &= \frac{count(w_i,c)+1}{\sum_{w \in V}{(count(w,c)+1)}} \\ &= \frac{count(w_i,c)+1}{(\sum_{w \in V}{count(w,c) + |V|})} \end{align} \]
From training corpus, extract Vocabulary
Calculate \(P(c_j)\) terms
docs \(docs_j\leftarrow\) all docs with class = \(c_j\)
\[P(c_j) \leftarrow \frac{|docs_j|}{|total\ \#\ documents|}\]
Calculate \(P(w_k|c_j)\) terms
\(Text_j \leftarrow\) single doc containing all \(docs_j\)
For each word \(w_k\) in Vocabulary
\(n_k \leftarrow\) # of occurrences of \(w_k\) in \(Text_j\)Add one extra word to the vocabulary, the “unknown word” \(w_u\)
\[ \begin{align} \hat{P}(w_u|c) &= \frac{count(w_u,c)+1}{(\sum_{w \in V}{count(w,c)})+|V+1|} \\ &= \frac{1}{(\sum_{w \in V}{count(w,c)}) + |V+1|} \end{align} \]
What proportion of instances is correctly classified?
TP + TN / TP + FP + FN + TN
Accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed.
Let us say that our target class is very sparse. Do we want accuracy as a metric of our model performance? What if we are predicting if an asteroid will hit the earth? Just say “No” all the time. And you will be 99% accurate. The model can be reasonably accurate, but not at all valuable.
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.
A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):
\[F = \frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}}=\frac{(\beta^2+1)PR}{\beta^2P + R}\]
The harmonic mean is a very conservative average; see IIR § 8.3
People usually use balanced F1 measure - i.e., with \(\beta = 1\) (that is, \(\alpha = 1/2\)): \(F = 2PR/(P+R)\)
AUC provides an aggregate measure of performance across all possible classification thresholds.
One way of interpreting AUC is as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’). AUC measures how well predictions are ranked, rather than their absolute values.
AUC ranges in value from 0 (100% wrong), to 0.5 (random classifier), to 1 (100% correct). So, for example, if you as a marketer want to find a list of users who will respond to a marketing campaign, AUC is a good metric to use since the predictions ranked by probability is the order in which you will create a list of users to send the marketing campaign.
Gee, I’m building a text classifier for real, now!
What should I do?
If (wheat or grain) and not (whole or bread) then Categorize as grain
Need careful crafting
Human tuning on development data
Time-consuming: 2 days per class
Use Naïve Bayes
Get more labeled data
Try semi-supervised training methods:
Perfect for all the clever classifiers
SVM
Regularized Logistic Regression
You can even use user-interpretable decision trees
Users like to hack
Management likes quick fixes
Can achieve high accuracy!
At a cost:
SVMs (train time) or kNN (test time) can be too slow
Regularized logistic regression can be somewhat better
So Naïve Bayes can come back into its own again!
With enough data
Domain-specific features and weights: very important in real performance
Sometimes need to collapse terms:
Part numbers, chemical formulas, …
But stemming generally doesn’t help
Upweighting: Counting a word as if it occurred twice:
title words
first sentence of each paragraph (Murata, 1999)
In sentences that contain title words
Corpus: is a large and structured set of texts
Stop words: words which are filtered out before or after processing of natural language data (text)
Unstructured text: information that either does not have a pre-defined data model or is not organized in a pre-defined manner.
Tokenizing: process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens (see also lexical analysis)
Natural language processing: field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages
Term document (or document term) matrix: is a mathematical matrix that describes the frequency of terms that occur in a collection of documents
Supervised learning: is the machine learning task of inferring a function from labeled training data
Unsupervised learning: find hidden structure in unlabeled data
Stemming: the process for reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form
Vector space model & BOW
Feature Selection
Text Classification
Evaluation